Exploring Unlabeled Data in Multiple Aspects for Semi-Supervised MRI Segmentation

Background: MRI segmentation offers crucial insights for automatic analysis. Although deep learning-based segmentation methods have attained cutting-edge performance, their efficacy heavily relies on vast sets of meticulously annotated data. Methods: In this study, we propose a novel semi-supervised MRI segmentation model that is able to explore unlabeled data in multiple aspects based on various semi-supervised learning technologies. Results: We compared the performance of our proposed method with other deep learning-based methods on 2 public datasets, and the results demonstrated that we have achieved Dice scores of 90.3% and 89.4% on the LA and ACDC datasets, respectively. Conclusions: We explored the synergy of various semi-supervised learning technologies for MRI segmentation, and our investigation will inspire research that focuses on designing MRI segmentation models.


Introduction
MRI (magnetic resonance imaging) segmentation refers to the process of partitioning or dividing MR images into different regions or structures based on their characteristics, such as tissue types, anatomy, or pathological features.This precise segmentation offers crucial insights to clinicians for accurate diagnosis, disease tracking, and treatment planning.Recent advancements in neural networks have led deep learning-based approaches to achieve state-of-the-art performance in various MRI segmentation tasks [1][2][3][4].
However, the efficacy of deep learning approaches hinges on large-scale dense annotations, which are extremely expensive and labor-consuming to obtain.This presents an important barrier to implement deep learning-based MRI segmentation models in real-world medical settings.Such a challenge has spurred research into semi-supervised techniques, which aim to train models using a small amount of labeled data combined with abundant unlabeled data, while still achieving satisfactory performance [5][6][7][8][9][10][11][12][13][14][15][16].
Within the realm of semi-supervised learning for MRI segmentation, advancements have given rise to 2 primary streams, namely, pseudo-labeling [5,6,13] and consistency regularization [7,8,10].Pseudo-labeling methods optimize models by extrapolating from a limited set of labeled data to generate pseudo-labels for the unlabeled data.These labels are then amalgamated with the labeled dataset for model refinement.On the other hand, consistency regularization techniques enforce consistency between predictions of different perturbations, such as data augmentations [7] and model perturbing [8].While pseudo-labeling harnesses information from unlabeled data, consistency regularization bolsters the stability and generalization of deep learning models.However, the synergy between these technologies-integrating pseudo-labeling and consistency regularization to maximize the utilization of unlabeled data-remains an underexplored territory in MRI segmentation, not to mention some other semi-supervised learning technologies, such as those based on generative networks [17,18].Efforts to fully leverage the complementary strengths of various semi-supervised learning technologies for optimized utilization of unlabeled data in MRI segmentation require further exploration and investigation.Furthermore, while there exists a range of semi-supervised MRI segmentation models, they predominantly adopt an encoderdecoder architecture.Therefore, in this study, we introduce an innovative and adaptable training approach for semi-supervised MRI segmentation.This approach offers dual benefits: (a) First, it merges the strengths of various semi-supervised learning techniques, fostering their synergistic influence on model refinement.(b) Second, it is designed to seamlessly integrate with encoder-decoder-based semi-supervised MRI segmentation models, ultimately enhancing their performance.

MRI datasets
In this study, we validated our method using 2 public MRI segmentation datasets [19,20].The requirement for written informed consent from enrolled patients was waived due to the retrospective design of the study.
The Left Atrial (LA) dataset: The LA dataset [19] was obtained from the 2018 Left Atrium Segmentation Challenge (https://www.cardiacatlas.org/atriaseg2018-challenge/atria-segdata/) and it includes 100 three-dimensional (3D) gadoliniumenhanced MRI scans collected by 2 clinical MR scanners (Avanto 1.5T and Verio 3.0T, Siemens, Germany) in patients with atrial fibrillation (AF) ranging in time from 3 to 27 months after clinical ablation at the University of Utah.As provided by the competition organizing committee, 3D MRI scans have a spatial resolution of 0.625 × 0.625 × 0.625 mm 3 and a spatial size of 576 × 576 × 88 pixels or 640 × 640 × 88 pixels.The LA cavity is defined as the pixels within the surface of the LA endocardium, including the mitral valve, the LA auricles, and the sleeve extent of the pulmonary veins (PVs).Three trained observers performed manual segmentation of each late gadolinium enhancement MRI (LGE-MRI) scan using Corview image processing software [Merrk Inc., Salt Lake City, UT (McGann et al., 2014)].The grayscale LGE-MRI image volumes and associated binary LA segmentations were stored in the nearly raw raster data (nrrd) format.Following previous studies [8][9][10], the dataset was split into a training set and a testing set at a ratio of 8:2.In the training dataset, the labeled proportion p was set to 10% or 20%, and the corresponding number of patients randomly selected as labeled patients with images and labels available (16 patients for p = 20%, 8 patients for p = 10%, all slices within one patient were simultaneously set to labeled or unlabeled), while others as unlabeled patients with only images available for training.Given that the existing research did not separate the validation set, we took the initiative to divide the initial training set into a fresh training set and a dedicated validation set.Initially, we trained the model using various configurations (e.g., p = 10% or 20%) based on this new training data.Subsequently, we assessed the model's performance using the validation set.Upon determining the optimal hyperparameters from this validation process, we retrained the model using these parameters on the original training data.Finally, we rigorously tested our model on the designated testing set to ensure a fair and comprehensive comparison with other existing models.
The ACDC dataset: The ACDC dataset [20] was sourced from the 2017 Automated Cardiac Diagnosis Challenge (ACDC) (https://www.creatis.insa-lyon.fr/Challenge/acdc/databases.html).This dataset contains MR images of 100 subjects, including 20 healthy volunteers and 80 patients divided into 5 subgroups on average: previous myocardial infarction, dilated cardiomyopathy, hypertrophic cardiomyopathy, and abnormal right ventricle.Each group was distinctly defined based on physiological parameters, including left or right diastolic volume, ejection fraction, local contraction of the left ventricle (LV), LV mass, and maximum thickness of the myocardium.Acquisitions were performed using 2 MRI scanners with different magnetic field strengths (Area 1.5T, Siemens, Germany and Trio Tim 3.0T, Siemens, Germany) over a period of 6 years in the Hospital of Dijon (France).Cinematic MR images were acquired retrospectively or prospectively gated under breath-hold using conventional steady-state free procession (SSFP) sequences.After acquisition of long-axis slices, a series of short-axis slices were acquired covering the LV from base to apex, with slice thicknesses ranging from 5 to 10 mm (usually 5 mm).Spatial resolution varied from 1.34 to 1.68 mm 2 /pixel.Depending on the patient, 28 to 40 volumes were acquired to completely cover (retrospectively gated) or partially cover (prospectively gated) one cardiac cycle.The complete dataset was acquired during a clinical routine, which resulted in natural variations in image quality, field-of-view variations, and overall or nearly overall coverage of the LV.The dataset was divided into the training set, validation set, and testing set at a ratio of 7:1:2, following previous studies [6].In the training dataset, the labeled proportion p was set to 5% or 10%, and the corresponding number of patients was selected as labeled patients with images and labels available (7 patients for p = 10%, 3 patients for p = 5%, all slices within one patient were simultaneously set to labeled or unlabeled), while others as unlabeled patients with only images available for training.

Method overview
The overall architecture of our method is shown in Fig. 1.It employs 3 strategies to capitalize on unlabeled data within existing encoderdecoder-based MRI segmentation models.Initially, pseudolabels are created by leveraging predictions from unlabeled data, facilitating model refinement using both labeled and unlabeled datasets.Second, feature perturbation is applied to the input features directed to the decoder, ensuring output consistency with the corresponding pseudo-label.Last, to cultivate richer feature representations, an auxiliary reconstruction task is introduced.
Here, we elaborate on each step to delineate their specific functionalities in the following sections.

Pseudo-labeling branch
We denote the input MRI volumes as X , which comprises labeled data X L and unlabeled data X U , i.e., X = {X L , X U }. Correspondingly, The ground-truth labels for X are denoted as Y = {Y L , Y U }. Y is voxel-wise, and each element in Y represents a manually annotated segmentation map corresponding in shape to its input volume.Notably, Y U denotes the manually annotated segmentation for X U , which is not available during training.
In the pseudo-labeling branch (as shown by the black arrow in Fig. 1), we apply the standard cross-entropy loss on both labeled and unlabeled data, and the Dice loss [21] is also used on the labeled data.Concretely, assuming that there are N 1 and N 2 MRI slices in X L and X U , respectively, the total loss from pseudo-labeling, denoted as  p , is derived by summing the supervised loss  s on labeled slices and the unsupervised loss  u on unlabeled slices. s integrates the standard cross-entropy loss and Dice loss, expressed as follows: where ( ⋅ ,⋅ ) represents the standard cross-entropy function, while ( ⋅ ,⋅ ) symbolizes the Dice loss.H and W denote the (1) height and the width of the input, respectively.Additionally, P i n indicates the predicted value at the pixel of i in the nth labeled slice, and Y (i,j) is the corresponding ground truth. u denotes the unsupervised loss and is a variant of the cross-entropy loss, which can be defined as follows: where 1 [•] represents the indicator function that excludes predictions with maximal confidences below the pre-set threshold τ 1 .The term P i n indicates the prediction at the pixel of i in the nth unlabeled slice, while ŶU i n = argmax P i n corresponds to the derived one-hot pseudo-label.The threshold τ 1 is set to 0.9 by default to ensure the reliability of the pseudo-labels generated.

Consistency regularization branch
To impose consistency constraints, illustrated by the blue arrow in Fig. 1, we introduce perturbations to the features supplied to the decoder.Subsequently, we align predictions made on these perturbed features with the predictions derived from the original, undisturbed features.Specifically, as depicted in Fig. 2 and given N slices, the formulation of consistency regularization is expressed as: where h(•) represents the convolutional layer preceding the prediction within the decoder, I n signifies the feature map fed into h(•), and g(•) denotes the perturbation function that introduces noise, drawn from a standard normal distribution, to each element in I n with a possibility of 5%.

Reconstruction branch
In recent years, deep generative methods have emerged as powerful tools for exploring the underlying distribution of training samples.Taking inspiration from the conditional variational auto-encoder (cVAE) [17] framework, we introduce an auxiliary reconstruction branch aimed at learning meaningful representations.As illustrated in Fig. 1, the encoder initially maps input data to a latent space.Subsequently, the decoder uses these latent space representations to reconstruct the input data.Specifically, as depicted in Fig. 3, the latent variable z is derived from the approximate posterior distribution q(z| X), which takes μ and σ as inputs, where μ and σ denote the mean and SD, respectively.Besides, as auxiliary noise variable ϵ is sampled from a standard normal distribution, the final computation of z is computed as: Utilizing the output of the encoder, denoted as E(X), we derive μ and log(σ 2 ) through 2 distinct fully connected neural network layers, independently processing the input E(X).Subsequently, the decoder takes z and E(X) as inputs, wherein E(X) undergoes reshaping and concatenation with z.Ultimately, the reconstruction is generated by passing the output of the decoder through a convolutional layer.Additionally, we also adopt the same mean squared error (MSE) loss in Eq. 3 to constrain predictions.

Teacher-student architecture
The teacher-student network, a highly regarded model architecture, is widely used for semi-supervised segmentation tasks.[7,[22][23][24].Building upon the framework of CTSSeg [7], we tailor a teacher-student architecture for our approach, as depicted in Fig. 4.Both the teacher and student networks share the same underlying structure, with the exception that the student network incorporates a Projector-a convolutional layer designed to manipulate features.In contrast to the student network, the teacher network updates its parameters (ϕ) not through gradient back-propagation but by using an exponential moving average of the student's parameters θ [25].Specifically, with a decay rate η ranging from 0 to 1, the parameter updates for the teacher network are defined as follows: where i denotes the ith iteration.By default, we set η = 0.95 following [7].
It should be noted that within this teacher-student architecture, the learning process of the student network is guided by pseudo-labels generated by the teacher network.

Implementation details
Input processing: For the LA dataset, the raw images were converted into a 3-channel PNG format, with spatial resolutions of either 576 × 576 or 640 × 640, and a Z direction depth (slice count) of 88.For the ACDC dataset, the raw images were converted into 8-bit grayscale PNG format, varying in size across different patients.The spatial resolutions ranged from 224 × 154 to 512 × 428, and the Z direction depth ranged from 6 to 18.All individual raw MRI data were normalized using their mean and SD of color intensity.To optimize the training of our network, we undertook specific preprocessing and data augmentation measures.For the LA dataset, the challenge was the minimal proportion of positive voxels, making it tough to train the network directly with these data.To address this, we resized and center-cropped each 3D volume to exclude the background's outer edges.Furthermore, we implemented random cropping to create 128 × 128 × 32 patches.This step was crucial for mitigating the imbalance between positive and negative voxels while (7)  simultaneously reducing the demand on graphics processing unit (GPU) memory and computational resources.For the ACDC dataset, the preprocessing step was simpler, involving resizing the inputs to a consistent dimension of 256 × 256.Our data augmentation tactics included applying random rotations and flips to both datasets, along with color jittering for the LA dataset.These methods aim to increase the sample variety, aiding in the reduction of overfitting.
Training schedule: All the experiments were performed on a deep learning server featuring Intel Xeon Gold 6230 CPU, 251 GB RAM, and 6 NVIDIA GeForce RTX 3090 GPUs (24 GB).The models were developed based on Python.The model weights were randomly initialized, and the network was trained from scratch using the adaptive moment estimation (Adam) optimizer.The batch size is set to 4 for the LA dataset and 24 for the ACDC dataset, while the number of epochs is set to 240 for LA and the iterations are set to 30,000 for ACDC.The base learning rate is set 1 × 10 −4 for LA and 5 × 10 −4 for ACDC with a cosine annealing scheduler with the maximum number of iterations equal to training epochs and minimal learning rate 1 × 10 −7 , and weight decay is set 1 × 10 −4 for both LA and ACDC.In accordance with [7], the self-supervised loss is weighted by a Gaussian ramp-up weight, expressed as δ = exp (−5(1 − min (t, T ramp )/T ramp ) 2 ), where t denotes the current training step and T ramp represents the duration of the ramp-up period, equivalent to one quarter of the total training iterations.In each training batch, we endeavored to maintain an even distribution of labeled and unlabeled data.Specifically, for the LA dataset, we duplicated the labeled data to match the volume of unlabeled data, followed by random sampling from the full dataset.In contrast, for the ACDC dataset, we utilized a method called TwoStreamBatchSampler [6], which allows for the sampling of an arbitrary mix of labeled and unlabeled data in each batch, promoting a balanced learning environment.

Statistical analysis
Following common practice [5,6], Dice score, intersection over union (IoU) score, 95% Hausdorff distance (95HD), and average surface distance (ASD) were used to assess the segmentation results of each model with the manual labels.
ASD [i.e., ASD(M1, M2)] is defined as: where s(•) denotes the length of the mask contour.In summary, the Dice score and IoU score are metrics used to assess the level of overlap between predicted areas and manually labeled areas, emphasizing the internal consistency of predictions.A higher score indicates better prediction accuracy.( 8) Overview of the teacher-student architecture for our approach.
Conversely, the 95HD and ASD metrics measure the distance between predicted results and manual labels, with a particular focus on segmentation boundaries.In this context, smaller values signify more accurate delineation.

Results
In contrast to the LA dataset, the ACDC dataset typically comprises fewer MRI slices.As a result, prevailing research commonly employs 2D models for ACDC while opting for 3D models in the case of LA.In this study, we developed our approach utilizing 3D-UNet [26] for LA and UNet [27] for ACDC.

Comparison with other methods
Comparative performance of our method on LA: The comparative performance of various models on the LA dataset is detailed in Table 1.The efficacy of the supervised 3D-UNet severely relies on densely annotated data, evident in its substantial performance decline from a 91.5% to 80.1% Dice score when labeled data reduce from 100% to 10%.In contrast, our method demonstrates a marked enhancement in the 3D-UNet's performance with just 10% labeled data.Specifically, 3D-UNet equipped with our approach achieves a 91.0%Dice score and a 84.2% IoU score, showcasing an improvement of +9.9%/16.6%over the standard 3D-UNet.
In another setting, where only 20% labeled data are available, our method yields even better results compared to a standard 3D-UNet trained on the complete labeled dataset.It exhibits a noteworthy +8.7%/+13.3%enhancement over the basic 3D-UNet in terms of Dice score and IoU score, respectively.Additionally, it shows a modest increase of +0.6%/+2.1% compared to the full-set 3D-UNet.Notably, our approach emerges as the superior performer across both 20% and 10% labeled data settings compared to other methods, consistently delivering the best results overall.
Comparative performance of our method on ACDC: Table 2 shows the comparison results of different models on the ACDC dataset.Different from the LA dataset, current methodologies usually report results for settings where 10% and 5% labeled data are available.Notably, our method consistently enhances the performance of the basic UNet model across both settings, consistently delivering superior overall results.Again, it is worth noting that our approach achieves top performance in all aspects among other MRI segmentation methods.
Comparative performance of our method with foundation models: We also conduct comparisons with the recently popular foundation models exemplified by the SAM series [28][29][30].Following the approach of MedSAM [30], we fine-tune these foundation models utilizing the bounding box prompts based on labeled data.It is worth noting that SAM-Med2D [28] has been pre-trained on both the LA and ACDC datasets; thus, we refrain from further fine-tuning SAM-Med2D on these datasets.Similarly, we opt not to fine-tune MedSAM using the ACDC dataset, given its prior pre-training on this dataset.Our results, as presented in Tables 1 and 2, consistently demonstrate the superiority of our method over all foundation models, even when utilizing only 10% of labeled data.This holds true even in scenarios where SAM and MedSAM are fine-tuned using the entirety of the labeled data.

Ablation study
In Table 3, the performance breakdown of our approach in ablation experiments is presented.These experiments were conducted on the LA dataset under settings with 20% and 10% labeled data availability, respectively.Our method comprises 3 integral streams: consistency regularization, pseudo-labeling, and reconstruction.Notably, the absence of any stream results in a performance decline across all evaluation metrics in both settings.It is evident that the complete model consistently achieves the highest performance across all metrics in comparison to individual stream exclusions.
To verify the robustness of our approach, we train our model 10 times with different sampled semi-supervised data settings, and the margin of error is presented by the SD across these diverse runs.The results in Table 4 show the robustness of our approach.
In Table 5, we examine the impacts of the teacher-student architecture.Evidently, the performance combined with the teacher-student architecture outperforms its non-teacherstudent counterpart in Dice by 1.8% and 1.6% on the 20% and 10% labeled data settings, respectively.

Qualitative evaluation
Figure 5 presents a comparative analysis between segmentations generated by 2 kinds of models: the 3D-UNet trained with varying amounts of labeled data and the 3D-UNet integrated with our method specifically designed for scenarios with limited labeled data (20% and 10% availability).
When only 20% of labeled data are accessible, the performance of the 3D-UNet noticeably diminishes in contrast to its performance when trained on a complete set of labeled data (as seen in the third row of Fig. 5).However, our method integrated into the 3D-UNet demonstrates the capability to predict segmentations that closely resemble those produced by the 3D-UNet trained based on 100% of labeled data (as depicted in the fourth row of Fig. 5).
In the more challenging scenario where only 10% labeled data are available, our approach substantially improves the performance of the 3D-UNet (as depicted in the last 2 rows of Fig. 5), showcasing its effectiveness in enhancing segmentation outcomes in scenarios of severely limited labeled data.

Local clinical dataset
Study population: This study was approved by the Medical Ethics Committee of Peking University Third Hospital Review Board (IRB00006761-M2021167).The inclusion criteria: Healthy males aged between 18 and 45 years, with a height ranging from 160 to 175 cm and a BMI (body mass index) within the normal range for males (18.5 to 23.9 kg/ m 2 ), were enrolled in the study.MRI was performed at baseline (month 0), and follow-up MRI was conducted at the 3rd month and the 9th month after enrollment.The exclusion criteria as follows: (a) MR image was not clear enough to analyze and (b) the volunteers voluntarily withdrew from the experiment.
Patient characteristics: A total of 37 volunteers' data, amounting to 111 sets of data, were included in the experiment.Figure 6 illustrates the histogram of age distribution.The average age of the volunteers was 31.65 years, with an average height of 168.51 cm, an average weight of 63.32 kg, and an average BMI of 21.79.The details of patients' characteristics mentioned above are summarized in Table 6.
MRI parameters: MRI was performed on a 3.0-T scanner (Discovery 750w, GE Healthcare, WI), using a 3D volumetric interpolated spoiled gradient echo (3D-VIBE) sequence with T1-weighted imaging and Dixon technique for fat-water signal separation to obtain T1-weighted fat-suppressed images.The sequence parameters are as follows: repetition time (TR): 5.4 ms; echo time (TE): 2.1 ms; flip angle: 12 ∘ ; slice thickness: 4 mm; scanning field of view: 420 mm; reconstruction matrix size: 512 × 512; bandwidth: 558 Hz/voxel.A GEM (gradient echo with multi-shot acquisition) matrix coil was used, with Table 1.Performance comparisons of different methods on the LA dataset.The best results are highlighted in bold font."-" for SAM-Med2D [28] means that this model has been pre-trained using LA, while "-" for other methods denotes that they did not report the corresponding results.Notably, BCP [44], ACTION [45], and ACTION++ [46] did not report results on the setting of 20% labeled data.LG-ER-MT [16] 0.855 0.751 13.29 3.77 GBDL [5] 0.884 0.792 5.89 1.60 SimCVD [47] 0.890 0.803 8.34 2.59 CMM [48] 0.860 --3.24BCP [44] 0.896 0.813 6.81 1.76 ACTION [45] 0.887 --2.1 ACTION++ [46] 0.899 --1.74 Ours 0.910 0.842 3.28 1.02 Table 2. Performance comparisons of different methods on the ACDC dataset.The best results are highlighted in bold font."-" for MedSAM [30] and SAM-Med2D [28] means that these models have been pre-trained using ACDC, while "-" for other methods denotes that they did not report the corresponding results.the scanning field of view extending from the anterior superior iliac spine to the calcaneus, and an axial section was acquired in segments.Each acquisition consisted of 100 slices (approximately 40 cm), with an overlap of more than 15% between adjacent acquisition ranges.Dataset split: There are 255 MR images in total, and muscle groups were delineated in 10 manual regions of interest (ROIs) as detailed in Table 7.For training and validation, we selected 180 and 30 MR images, respectively, with an average of 70.42% of 2D slices labeled in each image.The remaining 45 MR images, each fully labeled, constituted the test set.

Implementation details
Input processing: We implemented the spatial transformation component, which encompasses elastic deformation [27], random cropping, and random rotation.The cropping dimensions were established at 64 × 224 × 224.Subsequently, we applied voxelwise transformations as described in [7].
Training details: We employ the Adam optimizer, initiating an epoch, which is warmed up by a linear learning rate, followed by a cosine annealing schedule across 15 epochs.The base learning rate is set at 1 × 10 −4 , and the batch size is set at 2. Additionally, the self-supervised loss is weighted by a factor as described in [7]: δ = exp (−5(1 − min (t, T ramp )/T ramp ) 2 ), where t represents the current training step and T ramp is the duration of the ramp-up period, equivalent to one-quarter of the total training iterations.

Experimental results
We perform experiments using both the 2D and 3D versions of the UNet [26,27] as our baseline.As shown in Table 8, when equipped with our method, 2D-UNet and 3D-UNet achieve 87.3% and 89.5% Dice scores, which are +2.8%/+3.3%higher than their counterparts.Besides, 3D-UNet equipped with our method achieves 92.8% Dice score, delivering the best performance.We also provide a qualitative analysis as shown in Fig. 7.It is evident that both the 2D-UNet and 3D-UNet, enhanced with our method, exhibit superior performance compared to their standard counterparts.Particularly noteworthy is our method's ability to predict missing annotations, as demonstrated in the second row, even when some manual annotations are absent.

Discussion
In this study, a novel semi-supervised MRI segmentation method was proposed and validated across 2 distinct public datasets and one private dataset.Notably, when the labeled data are limited, our approach enables 3D-UNet/UNet to deliver competitive performance in contrast to the counterparts trained on the full label set.
Semi-supervised MRI segmentation models aim to minimize dependence on labeled data while maintaining commendable performance levels.The essence of these models lies in effectively harnessing a significant volume of unlabeled data.In contrast to existing semi-supervised MRI segmentation approaches, our proposed method excels in its intricate utilization of unlabeled data across various dimensions.Primarily, we adopt a pseudo-labeling strategy to derive pseudo-labels for unlabeled MRI scans.These pseudolabels function akin to ground-truths, empowering the model to glean insights from these supplementary data.Numerous computer vision studies have proven that the integration of unlabeled data through pseudo-labeling typically results in performance enhancements [6,13,[31][32][33][34][35][36][37], because the model gains a broader perspective from the expansive unlabeled dataset, learning more generalized features that might not be present in the smaller labeled dataset.Second, we introduced a straightforward yet efficient consistency regularization stream to bolster the performance.Consistency regularization aims to make the model's predictions robust and stable by ensuring that similar inputs (either from augmented versions of the same data or different views of the same instance) produce similar predictions [7,8,10,32,[38][39][40][41][42].It encourages the model to learn more generalizable representations that are less sensitive to minor input alterations.Last, we proposed a reconstruction stream to further explore the distribution of the training data.In semi-supervised learning, generative techniques involve training models to generate new data samples that resemble the distribution of the original dataset.These methods use generative models, such as variational autoen-(VAEs) [17,43], to accomplish this feat.We build our reconstruction module based on a conditional variational autoencoder (cVAE) [17], enforcing the model generating synthetic samples that simulate the input samples conditioned on those inputs.By capturing the statistical properties of the dataset, our reconstruction stream assists the model in discerning the underlying data distribution.
Benefiting from the design of different strategies for leveraging unlabeled data, UNet/3D-UNet equipped with our approach could achieve competitive performance in contrast to the counterparts that trained in the full labeled set when the quantity of labeled data is scarce (e.g.only 10% labeled MRI scans are available).Besides, our approach achieves the best overall performance among the current semi-supervised MRI segmentation methods.As shown in Results, each of our proposed strategies to harness unlabeled data has been proven to be effective and the synergy between these strategies is evident, with each addition enhancing the performance of the others.This strongly confirms our hypothesis that different strategies for leveraging unlabeled data can complement each other.

Limitations
First, our model was not tested on MRI scans of a wide range of tissues, as the current popular public benchmarks used in this study are all about the heart and the private dataset is about the muscle.However, our method should have a broad applicability without requiring major changes, because the technologies proposed in this study are independent of segmented objects.Therefore, future research endeavors could prioritize adapting our network for segmenting MRI scans of different tissues.
Second, while our study primarily focused on UNet/3D-UNet, our method has the potential to integrate with various encoderdecoder architectures by adapting certain parameters.Hence, there remains an interest in exploring how our approach could enhance a broader spectrum of MRI segmentation models.
Finally, there are potentially more semi-supervised technologies under exploration, such as semi-supervised generative adversarial network (GAN)-based methods.Initial findings in this study demonstrate the mutual reinforcement among different semi-supervised learning strategies.Therefore, we envision that purposefully designing additional streams of semisupervised technology can further amplify the performance of our method.

Conclusion
This study developed a novel semi-supervised MRI segmentation method that explores unlabeled data in various aspects.Our findings demonstrate the effective synergy between distinct semi-supervised technologies, significantly reducing the dependence of UNet/3D-UNet on annotated data while preserving satisfactory performance levels.

Table 3 .
Ablation studies of different components in our method, without the teacher-student architecture, on the LA dataset.The best results are highlighted in bold font.

Table 4 .
Margin of errors (SD) on the LA and ACDC datasets

Table 5 .
Ablation studies of the teacher-student architecture on the LA dataset.The best results are highlighted in bold font.

Table 7 .
Description of each ROI in the local dataset S.R., symbolic representation; R., ROI

Table 8 .
Performance comparisons of different methods on the local clinical dataset are presented, with results reported as average Dice scores.The best results are highlighted in bold font.

Table 6 .
The characteristics of patients .Q.L. and D.Y. were involved in data acquisition and analysis.P.W. was involved in manuscript revision.and D.L. supervised the research.Competing interests: H.H. is an editorial board member of the Health Data Science journal, and the authors declare no other competing interests. experiments